The data is extracted from 1994 US census database and was found at the UCI ML repository: https://archive.ics.uci.edu/ml/datasets/Adult
I will try to analyze how different sociodemographical indicators affect the likelihood of a person earning more than 50,000$ a year.
The comments about the chunk are given before the chunk.
First, let’s import the dataset and format it a bit for easier exploration.
Initialize, read file, assign column names.
Change ‘?’ values to proper NA objects, and drop unused levels.
Arrange education levels by the provided ‘education_num’ variable.
Arrange other factors by frequency of high income(High Salary Ratio, HSR) - from lowest to highest.
Remove unneeded columns.
Group very low frequency workclass levels together.
Remove spaces from factor levels.
Add some handy shortcuts to ggplot functions.
I’ll refer to high salary ratio of a group (number of people from the group having high income, divided by group size) as HSR.
HSR is indicating the income level of the group.
We’ve stored HSR’s for variable levels for each variable in the adult_by[[variable]] list.
HSR of the total population is 0.24.
## [1] 0.2408096
The following HSR plots actually belong to the bivariate section, but I chose to put them together with the histograms for easier interpretation.
Age
75% of the people are under 50, with mean = 38.58 and median = 37.
HSR increases from 16 to 50, then declines.
For some reason, ages 79 and 83 have very high HSR - these are probably outliers, but there’s no evidence in support of removing them.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 17.00 28.00 37.00 38.58 48.00 90.00
Workclass
70% of the sample work in the Private sector, which also has the lowest HSR of 22% (I don’t count ‘No_pay’ as it only has 22 members and obviously 0 HSR).
56% of Self-emp-inc (probably company owners) have high income, federal government staff are also paid rather well.
Education
71% of the people are HS-grads, Some-college or Bachelors.
I arranged education by the (natural) education level, so we see that higher educated people earn more.
There are 3 distinctive education groups in terms of income:
HS-dropouts (HSR = 0…7%),
HS-grad to Assoc-acdm (HSR = 16…26%),
and people with at least Bachelors degree (HSR = 41…74%).
Marital status
78% of the people are Married-civ-spouse or Never-married.
Married-civ-spouse and Married-AF-spouse(AF stands for Armed Forces) have the highest HSR (around 44%).
About 1/3 of the respondents were never married, and they have the lowest HSR(4.6%).
85% of the high income is due to the the Married-civ-spouse status.
## [1] "Part of the high income people that have Married-civ-spouse marital status:"
## [1] 0.8534626
Occupation
About 25% of the data are Prof-speciality and Exec-managerial - two highest HSR categories (HSR = 45-48%) .
Priv-house-serv occupation has HSR of only 0.6%.
Relationship
We have very small number of wives compared to husbands.
Wives and Husbands have very high HSR(45-47%).
Wives have even higher HSR than Husbands, despite that women have HSR of 11% and men of 30.5%.
Same as for marital_status, 85% of the high income belongs to Husbands and Wives.
## [1] "Part of the high income people that are Husbands or Wives"
## [1] 0.8497641
Race
85% of the people are white.
Black/Native-American/Other have half as high HSR than White and Asian (9-12% compared to 25%).
Sex
For some reason, there are twice as many men as women in the survey, and men are also paid much better:
Male - 67%, Female - 33%.
HSR for men - 31%, while HSR for women - 11%.
Hours per week
47% of the sameple work 40 hours, so for the histogram let’s use a log scale.
Surprisingly, top HSR is at about 60 hpw(probably because the top-paid executives don’t work long hours), and people working 100 hpw have about the same HSR as standard 40-hpw people.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
Native country:
90% of the people are US-natives, so let’s use log scale.
Apart from the US, there is 643 people from Mexico, other countries don’t have more than 200 people.
Most of the low-hsr countries are from Caribbean and Latin-American regions.
US-natives are somewhere in the middle, and the top of the list are mostly developed countries, not including Iran, which surprisingly has the highest HSR.
There are 32561 people of ages 17-90.
There are 11 variables in my dataset: age, workclass, education, marital_status, occupation, relationship, race, sex, hours_per_week, native_country and income.
The variables age and hours_per_week are integer variables, the other variables are factors.
Medians for numerical data or modes(most frequent levels) for factors:
age: 37
workclass: Private
education: HS-grad > Some-college > Bachelors
marital_status: Married-civ-spouse > Never-married
occupation: Prof-speciality, Exec-managerial, Craft-repair, Adm-clerical, Sales, Other-service
relationship: Husband > Not-in-family
race: White
sex: Male
hours_per_week: 40
native_country: United-States
income: low - 86%, high - 24%
The main feature that I’m interested in and want to relate to other variables is income.
The main features that I expect to be influencing income are sex, education, workclass, marital_status and occupation.
I removed these features:
fnlwgt - constructed variable (by the census takers), meaning of the variable unclear.
education_num - duplicating education, I took education ordering from it.
capital_gain, and capital_loss - present only for small part of the data.
I expect all other features to be of interest.
I didn’t create any additional variables.
The NA values were present as ‘?’, changed it to standard NA.
I’ve made the ‘adult_by’ list, which contains HSR values for each level of each variable.
Factor level ordering:
I’ve ordered education by education_num variable.
As there is no intrinsic ordering for other factors, I’ve ordered them by HSR.
As for distributions, we have much more men than women.
People of ages 79 and 83 have unusually high HSR.
The HSR of wives is a bit higher that for husbands, while HSR for women is much less than for men.
As a curious addition, we have 2 male wives and 1 female husband. These are probably errors.
Women are younger than men by about 3 years.
Women: median - 35, mean - 36.86.
Men: median 38, mean - 39.43.
## adult$sex: Female
## [1] 36.85823
## --------------------------------------------------------
## adult$sex: Male
## [1] 39.43355
## adult$sex: Female
## [1] 35
## --------------------------------------------------------
## adult$sex: Male
## [1] 38
High-income people are older than low-income by about 9 years.
High income: mean - 44.25, median - 44.
Low income: mean - 36.78, median - 34.
## adult$income: low
## [1] 36.78374
## --------------------------------------------------------
## adult$income: high
## [1] 44.24984
## adult$income: low
## [1] 34
## --------------------------------------------------------
## adult$income: high
## [1] 44
Looks like most of the low income is in low hours per week.
Let’s see how much people with different education work.
There’s a definite curve in the mean hpw (red diamonds).
On average, people with only Preschool education work 36.46 hours while people with Doctorate degree work 46.97 hours.
All the races look pretty similar, only the Asian-Pac-Islander have a little less HS-grads and a little more Bachelors(and other high educations).
This is natural as Asian-Pac-Islander is the highest-HSR race.
HSR for men with high education is very high.
For both genders there is a breaking point at HS-grad (better seen on women) - 96% of high-income people are HS-grad or higher. (note that the scale is logarithmic)
While the median is 40 hpw for both groups (40 hpw is standard), high-income people on average work 6.6 more hpw than low-income people.
## adult$income: low
## [1] 38.84021
## --------------------------------------------------------
## adult$income: high
## [1] 45.47303
High-income people are on average 7.5 years older.
Median age difference is 10 years.
## adult$income: low
## [1] 36.78374
## --------------------------------------------------------
## adult$income: high
## [1] 44.24984
## adult$income: low
## [1] 34
## --------------------------------------------------------
## adult$income: high
## [1] 44
99.9% of the husbands and wives (highest HSR relationships) are married-civ-spouse(highest HSR marital status).
98.4% of Married-civ-spouse are husbands or wives.
89% of ‘Own-child’ are ‘Never-married’.
76% of high-income is due to ‘Married-civ-spouse’ marital status.
85% of high income is due to ‘Husband’ relationship.
Women are much more likely than men to be in ‘Never-married’/‘Widowed’/‘Divorced’/‘Separarted’ marital statuses and ‘Unmarried’/‘Not-in-family’ relationships, and much less likely to be Wives than men to be Husbands.
Thus, we can conclude that men are much more likely to be married that women. We can only wonder who they are married to (as the data is from 1994).
Younger people are mostly Never-married, middle-aged are Married or Divorced.
Older people are mostly Married, Divorced or Widowed.
The youngest marital_status is ‘Never-married’, the oldest is ‘Widowed’. No surprises.
Most of the younger people are Own-child.
Middle-aged and older people are Husbands, Not-in-family or Unmarried.
As we’ve seen, number of husbands is much higher than number of wives.
‘Own-child’ is the youngest, ‘Husband’ is the oldest, ‘Wife’ is a bit younger. As with marital_status, no surprises.
49% of high-income people come from 2 highest-paid occupations - Prof-Specialty and Exec-managerial.
3d and 4th-paid occupations - Craft-repair and Sales - account for another 24% of high income.
The people in Self-emp workclasses are mostly men.
NA category has the highest women/men ratio.
The majority of the difference in male-female populations is due to most common ‘White’ race.
Black race has very high woman/man ratio - 1569 men to 1555 women.
Let’s see which occupations are dominated by either sex.
Remeber that the occupations are arranged by HSR.
The highest female ratio is in the lowest-paid occupation(‘Priv-house-serv’).
Other female occupations: Adm-clerical and Other-service.
Male occupation: Handlers-cleaners, Armed-forces, Transport-moving, Craft-repair, Protective-serv.
Pretty much as expected.
Let’s now look at the correlations between features.
The correlation matrix is made as follows:
Factor/Factor - Cramer’s V, Factor/Numerical - eta (ANOVA), Numerical/Numerical - Pearson’s r.
The map is very approximate, as there are different correlation measures, but we can nevertheless make some conclusions.
1. Strong correlations between features
relationship - martial_status (as expected). They basically duplicate each other, so we could consider removing one of them.
relationship/marital_status - sex (as expected)
sex - occupation (we’ve explored it)
race - native country (as expected)
age - relationship/marital_status
hours_per_week correlates with most of the other features (except race/native_country and age) at an average amount.
2. Strong correlation with income
income - relationship/marital_status
income - education
income - occupation
Income somewhat correlates with most of the features.
3. Weak correlations
Race and native_country don’t correlate with anything besides themselves, and they have lowest correlation with income. We should consider removing both of them.
As we know, about 90% of the people are White and US-native, 92% of US-natives are White and 88% of White are US-natives, so this features aren’t really helpful.
Education mainly correlates with income, and by a smaller amount with age, occupation, and hours_per_week. We don’t see significant correlation with sex, race/native_country or relationship.
Age does not correlate with sex, race or native_country. As we expect these features to be independent, this indicates a rather good quality of the sample.
We still don’t see any explanation of the gender distribution.
Income strongly correlates with relationship/marital_status, ocupation and education.
Income has the least correlation with race/native_country and workclass.
Native_country and race are strongly connected: 90% of people have US as native_country and 90% of people have ‘White’ race. 92% of US-natives are White and 88% of White are US-natives.
As they don’t correlate with other variables (including income), we should consider removing both.
Relationship and marital_status have a very strong relationship too and have a strong connection with sex as sex divides married people into ‘Husband’ and ‘Wife’ relationship. We should consider removing either of them (probably relationship) to avoid multicollinearity.
Also these two variables have strong connection with age - younger people are unmarried, middle aged are Married or Divorced and older people are mostly Married, Divorced or Widowed.
Occupation does have relatioship with sex, as there are occupations dominated by either sex.
The strongest relationship is between ‘sex’ and ‘relationship’. Men are much more likely to be married and women are much more likely to be unmarried, which gives rise to questions about sample quality, as we expect genders to be equally married in the population.
Women have lower HSR, and are a bit younger.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
There are too many data points, so let’s look at the smoothed lines.
Curves for the most of education levels have hill-like shape, with maximum at about 45-55 years.
The HSR curves for each education dominate the curves for lower educations for almost every age.
The Preschool’s smoothed HSR is just flat 0.
11th, 12th and Prof-school educations have a rise on the higher age. These are probably due to the outliers of the age of 79 and 83.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
As we’ve already seen, people with better education are older.
Also, there is a bunch of very high age outliers for the higher educations.
People who have have average education have very high age difference between high and low income - up to almost 20 years.
We see that people of the lower education categories (from Preschool to about 9th grade) are older than the subsequent categories.
That is probably because these people are most likely dropouts (and they could have dropped out a while ago), while people with 10th grade education or higher could be still studying (we have respondents of age 16+ in the survey).
Men with lower education are younger than women with same education, and the opposite is true for higher education.
We see that in Own-child and Husband/Wife relationships men are older, and in other relationships women are older.
Younger/older people and women work less.
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
The age distributions of every relationship except Unmarried and Husband/Wife are similar.
There are more unmarried women than men, and there are way more husbands than wives.
The countries are ordered by HSR.
On average, low-income countries have more women (relative size of men/women is controlled by violin size).
On all of the income scale there are countries with older men (Vietnam, Taiwan), and countries with older women (Peru, England).
## Warning in max(data$n): у 'max' нет не пропущенных аргументов; возвращаю -
## Inf
## geom_path: Each group consist of only one observation. Do you need to adjust the group aesthetic?
The top of the income for men is at about 50 years, while for women it is about 42 years. The difference in age for genders is only 3 years.
For every education level, the highest income is at about 45-55 years.
The age distributions of all relationships except wife/husband are similar for both genders.
Women of all ages work about 5 less hours per week than men of the same age.
People with lower education are older than people with average education - this is because the first group are likely dropouts, and the secong group could still be studying. As we’ve already explored in the bivariate section, many of the occupations are dominantly male or female.
When working more than 60 hours, your income declines with more hours.
The main demographic characteristics are gender and age. Let’s look at how they interact with income in our data.
For low income, the ratio of men to women is about about 3:2 on all the age scale, and the number of people is linearly declining from about 25 years.
In the high income histogram, however, there are 6 times more men than women, and the age distribution is more bell-shaped (or bimodal?) with top values at 30-50 years.
It is also interesting to explore how working hours correspond to income.
It would be natural to expect more hours to be paid better, but it is not entirely the case.
0-25 hpw are naturally paid worse (HSR ~0.1), and in the range of 25-60 hpw HSR rises up to 0.4.
But after 60 hpw, on average, HSR decreases.
People who work 100 hours a week earn about the same as people who work 40 hours.
This could be explained by the fact that most of the top paid individuals don’t work very long hours.
Another relationship that is interesting to look at is gender distribution for different occupations, so we can see which professions are dominated by either gender.
The occupations are ordered by HSR (leftmost occupation is worst-paid).
Average female percentage is denoted by the horizontal line.
Occupations that have more women than others: Priv-House-Serv(lowest-paid occupation - HSR = 0.6%), Adm-clerical and Other-service.
Male occupations: Handlers-cleaners, Armed-forces, Transport-moving, Craft-repair, Protective-serv.
Two top-paid occupations (Prof-speciality and Exec-managerial) have about 1/3 of women, same as in the whole population.
The difference in gender distribution between occupations is probably the main cause of income inequality between genders.
The dataset contains sociodemographical information about 32561 US-based individuals.
The dataset mostly contains categorical variables, and having more numerical variables, namely income, would make exploration even more exciting.
I started the epxloration by examining the distributions of individual features and their connections with income.
I noticed that, for an unknown reason, there is a disproportionately high amount of men.
For many of the features, I was able to identify parts which contain most of the high-income. Some of this income inequality is natural - like in occupations or education, but some - like in sex or race could be a sign of either social inequality in the population or very bad sampling.
Then, I explored relationships between features to better understand distributions and find some interesting patterns, like how occupations differ by gender presence.
Then, I’ve plotted the correlation matrix, which helped understand which relationships between features are stronger than others.
As a result, I saw that most of the features have strong correlations with only low amount of other features. Unlike the other features, income has some correlation with most of the features. Also, features ‘race’,‘native_country’ and either ‘marital_status’ or ‘relationship’ carry only very little information and could be removed from the model(which I didn’t make here) with losing only very little predicting power.
I’ve found that while income correlates with sex and relationship, education doesn’t, and both income and education don’t correlate with race.
In the last section of the analysis, I’ve found some more interesting patterns and strengthened the relationships that were found earlier.
I didn’t make a model to try to predict the income, and it would be the next logical step if I continue working with this data.
To summarise, the dataset proved to be very interesting, although some of the features present more questions than answers. There is still a lot to do, but from what I’ve explored, I can say that I know a lot about the people who comprise the dataset.